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ABSTRACT 


This  paper  extends  the  basic  work  that  haa  been  done 
on  zero-sum  stochastic  games  to  those  that  are  nonzero- 
aum.  Appropriately  defined  equilibrium  points  are  shown 
to  exist  for  both  the  case  where  the  playera  seek  to 
maximize  the  total  value  of  their  discounted  period 
rewards  and  the  case  where  they  wish  to  maximize  their 
average  reward  per  period.  For  the  latter  case,  conditions 
required  on  the  structure  of  the  Markov  chains  are  less 
stringent  than  those  imposed  in  previous  work  on  zero-aum 
stochastic  games,  extensions  to  n-person  games  and  underlying 
semi-Markov  processes  are  discussed,  and  finding  an 
equilibrium  point  is  shown  to  be  equivalent  to  solving  a 
certain  nonlinear  programming  problem. 
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CHAPTER  1 


INTRODUCTION 


A  stochastic  game  combines  a  finite  state,  diLcrete  time  sequential 

decision  process  with  two  person  game  theory  in  the  following  way:  at  time 

n  ,  two  players  are  Jointly  in  some  state  i  ,  i  1  1 .  N  ,  in  which  they 

play  a  bimatrix  game  [ A ^ , B 1 ]  .  If  the  players  choose  row  k  end 

column  £  respectively,  then  a^  is  the  reward  to  player  I  and  b^  the 

k£ 

reward  to  player  II.  The  players'  choices  also  determine  p  ,  the 
probability  that  the  players  move  from  state  i  to  state  j  at  time  n  +  1 

J  «=  1 . N  . 


A  stationary  strategy  for  player  1  in  state  i  is  a  probability  vector 

-  ^xil,xi2’  xik  J  whcre  xijc  is  the  probability  that  player  1 

chooses  the  kth  row,  and  player  I  uses  whenever  in  state  i  .  Similarly 

a  stationary  strategy  for  player  II  in  state  i  is  a  probability  vector 

yi  "  (yil,yi2 . yii  j  where  is  the  probability  that  player  II 

chooses  the  £th  column,  and  player  II  uses  y^  whenever  in  state  i  .  If 

the  players  have  chosen  strategies  x^  and  y^  ,  then  player  I's  expected 

reward  for  period  n  is 


“i 

I 

k-1 


r 

> 


i"  1 


ak£Xikyi£ 


and  player  II' s  expected  reward  is 


"i 

l 

k-1 


“i 

l 

£-1 


b 


i 

k£Xikyi£  ’ 


At  time  n  +  1  ,  the  players  will  be  in  state  j  with  probability 


2 


*  i 

l 

k-1 


“i 

l 

£■1 


k£ 

PiJXikyi£ 


where  the  bimatrix  game  will  be  played,  stationary  strategics 

xj  »  employed,  and  a  transition  to  a  new  state  made.  The  game  continues 

in  this  manner  over  an  infinite  horizon,  the  movement  of  the  players  being 
governed  by  the  Markov  chain  (p^)  • 

There  are  several  possibilities  for  the  objectives  of  the  two  players. 

We  will  first  study  the  case  in  which  the  players  seek  stationary  strategies 
x  »  (x.  ,  ...,  x.,)  and  y  -  (y.  ,  ...,  y  )  respectively  which  will  uniformly 
maximize,  for  all  initial  states,  the  discounted  value  of  their  total 
expected  rewards.  Then  we  will  examine  the  case  in  which  the  players  desire 
to  maximize  their  expected  reward  per  period,  and  seek  stationary  strategies 
to  do  so.  These  will  be  referred  to  as  the  discounted  case  and  average  rate 
of  return  case,  respectively.  It  is  clear  that  what  is  good  for  one  player 
may  be  bad  for  the  other,  so  it  will  generally  be  impossible  for  both  players 
to  simultaneously  achieve  these  objectives  (in  the  zero-sum  game,  this  is 
always  the  case  since  with  A*  ■  -B*  ,  the  players  have  directly  opposing 
Interests).  Hence,  we  turn  to  the  concepts  of  a  "value"  for  a  zero-sum 
stochastic  game  and  an  "equilibrium  point"  for  a  nonze'o-sum  stochastic  game, 


discussed  in  Chaiter  2. 


The  literature  on  stochastic  games  is  not  extensive.  The  first  article 
appeared  in  1953,  when  L.  S.  Shapley  [15)  first  described  the  game.  Shapley 
proved  the  existence  of  an  appropriately  defined  /alue  for  a  zero-sum  game 
with  total  discounted  rewatds  as  the  payoffs.  He  showed  that  an  optimal 
strategy  that  achieves  the  value  can  be  taken  to  Lc  stationary,  i.e.,  the 
piayers  can  use  the  same  strategies  every  time  they  are  in  state  i 
independent  of  the  time  period  in  which  they  arrive  in  state  i  ,  and  he 
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provided  on  Algorithm  for  the  determiner  ion  of  optimal  strategics  ond  the 

value . 

The  average  rate  of  return  zero-sum  game  was  treated  by  D.  Cillctte  (4) 
in  1957.  Whereas  the  structure  of  the  Markov  chain  governing  the  transitions 
of  the  players  can  be  arbitrary  in  the  discounted  zero-sum  game,  Gillette 
allowed  that  this  is  not  the  case  when  average  rate  of  return  is  the  objective 
if  we  hope  to  have  stationary  strategies  yield  a  value  for  the  game.  He 
accomplished  this  by  proving  that  if  all  possible  underlying  Markov  chains 
are  irreducible,  then  a  value  exists  and  can  be  achieved  by  stationary 
strategies,  and  he  gave  an  example  of  a  game  having  a  reducible  chain  for 
which  a  value  could  not  be  attained  by  stationary  strategics. 

Gillette's  results  were  redcrived  from  a  linear  programming  approach  by 
Hoffman  and  Kerp  [6],  Their  results  required  the  retainment  of  the 
irrcducibility  assumption.  In  addition,  they  presented  an  algorithm  which 
converges  to  stationary  strategies  yielding  the  value  of  the  game. 

The  results  that  follow  generalize  those  above  to  nonzero-sum  stochastic 
games  and  provide  a  relaxation  of  the  in educlblllty  assumption  in  the 
average  rate  of  return  case.  Following  the  work  of  Nash  [13]  on  nonzero-sum 
games,  the  existence  of  appropriately  defined  equilibrium  points  for  nonzero- 
•uu  stochastic  games  is  proven  for  both  the  discounted  and  average  rate  of 
return  games.  In  the  latter  case,  the  irreducibillty  assumption  is  weakened 
to  allow  for  some  transient  states  a<?  long  as  every  possible  underlying  chain 
has  a  single  ergodic  subclass  of  states.  For  the  average  rate  of  return 
case,  an  equilibrium  point  is  shown  to  be  equivalent  to  solving  a  nonlinear 
programming  problem  and  extensions  to  n-person  games  and  underlying  s  mi- 
Markov  processes  discussed.  As  a  biproduct  of  these  efforts  in  the  discounted 
case  and  average  rate  of  return  case  with  irreducible  chains,  we  get  a 
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characterization  of  the  set  of  stationary  optimal  policies  for  a  sequential 
decision  process,  the  process  that  results  from  letting  one  of  the  players 
be  a  "dummy"  with  only  one  possible  action  available  in  each  state. 
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CHAPTER  2 
DISCOUNTED  CASE 


2. 1  introduction 

Since  a  nonzero-sum  stochastic  game  can  be  viewed  as  the  marriage  oi  a 
nonzero-sum  game  and  a  discrete  dynamic  programming  problem  (sequential 
decision  process),  it  comes  as  little  surprise  that  the  major  results  for 
such  games  deperd  heavily  on  the  results  and  structure  of  both  these  subject.13. 
Underlying  this  relationship  is  the  fact  that  the  major  element  in  proving 
the  existence  of  equilibrium  points  fot  nonzero-sum  games  is  the  character 
of  the  set  of  optimal  strategies  for  one  player  when  opposing  a  given 
stationary  strategy  of  the  other. +  But  in  a  stochastic  game,  when  a  player's 
opponent  fixes  his  strategy,  the  player  is  .aced  with  precisely  a  sequential 
decision  process. 

In  the  following  two  sections,  reviews  of  Nash's  work  on  nonzero-sum 
games  [13]  and  discrete  dynamic  programming  will  be  presented  and  notation 
set  up.  Then,  in  2. A,  the  results  from  these  areas  will  be  put  together  to 
establish  the  existence  of  an  equilibrium  point  for  a  nonzero-sum  stochastic 
game  with  expected  discounted  totals  the  objective. 

2.2  Bimatrix  Games 

Consider  a  two-person  nonzero-sum  bimatrix  game  [ A , B ]  ,  where  A  and 

B  are  K  x  L  matrices,  K  ,  L  <  ®  ,  (A)  -  a  ,  (B)  -  b  .  Player  I 

*  »J 

(the  "row  player")  hat  K  pure  strategies  e^»e2*  •••»  eK  where  e^  is  the 
kth  unit  vector  and  player  I's  use  of  e^  represents  his  choice  of  the  kth 
row  of  the  matrices  A  and  B  with  probability  1.  Similarly,  p’ayer  II 

^The  games  considered  are  two  person  unless  otherwise  indicated. 
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(the  "column  player")  has  L  pure  strategies  ei*e2»  •••»  eL  where  player 

II' 8  use  of  e^  represents  his  choice  of  the  ith  column  of  A  and  B  with 

probability  1.  Corresponding  to  each  pair  of  pure  strategies  ^e(t»e£^  *  orUi 

strategy  being  taken  for  each  player,  are  the  rewards  a^  and  to 

players  I  and  II  respectively.  Mixed  strategies  x  -  (x^.Xj,  . x^)  and 

y  ■  (ylty2 . y^)  represent  probability  distributions  over  the  choices  of 

pure,  strategies  for  the  players,  and  when  employed,  result  in  expected  reward 
K  L  K  L 

xAy  -  l  l  a  x  y  for  player  I  and  expected  reward  xBy  -  [  [  b  x  y 

k-1  l-l  1  k-1  i-1  '• 

for  player  II. 

A  pair  of  strategies  (x°,y°)  is  said  to  be  an  "equilibrium  point"  if 
x°  Maximizes  xAy°  and  y°  maximizes  x°By  .  The  appealing  aspect  of  an 
equilibrium  point  is  the  stability  of  such  a  point  in  the  sense  that  each 
player  can  do  no  better  than  to  use  his  equilibrium  strategy  when  opposing 
the  eqc llinrium  strategy  of  the  other.  (For  a  discussion  of  equilibrium 
points,  their  properties  and  drawbacks,  see  Luce  and  Ralffa  ( 10] - ) 

Kash  set  up  the  problem  of  establishing  the  existence  of  an  equilibrium 

point  for  the  above  game  by  forming  a  closely  associated  correspondence  whose 
fixed  points  are  precisely  the  equilibrium  points  for  the  game.  Let 


X  - 

x 

1  x  e 

ek 

K  ] 

»  E  xk  “  1  »  Xk  -  ° 

k-1  *  ] 

be  player  I's  strategy  space 

Y  - 

It 

1  y  e 

el 

V 

r — it- 

v: 

>o 

1 

h-» 

v: 

■  V 

O 

|  be  player  II's  strategy  space 

♦x(y)  -i 

( 

X 

|  max 

xAy 

i-i  ‘ 

-  xAy  l 

1  1 

♦2<x)  “i 

y  1 

xcX 

max 

xBy 

f 

-  xBy  )  . 

1  \ 

yeY 

/ 

Note  i 


YxX 


*  *2  :  Y  *  X  -  2 
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Now  (x°,y°)  c  $^(y°)  *  ^(x0)  x°  1  ^(y0)  and  Y°  c  $2^x°^  *  Hence 

(x°,y°)  Is  an  equilibrium  point  of  the  game  [A,B]  if  and  only  if  (x°,y°) 

is  a  fixed  point  of  •  Having  established  the  correspondence  between 

equilibrium  points  of  the  game  and  fixed  points  of  the  correspondence 

$1  *  #2  »  it  only  remains  to  prove  the  existence  of  a  fixed  point  for 

^  x  4> 2  •  Since  X  and  Y  are  nonempty,  compact  and  convex,  this  can  be 

accomplished  by  Kakutani's  fixed  point  theorem  [ 9 J  which  requires  that 

•  f 

*  #2  have  a  closed  graph  and  that  $j(y)  *  be  convex  and  nonempty 

for  all  (y,x)  c  Y  «  X  ,  all  of  whicli  hold. 


2.3  Sequential  Decision  Processes 


Consider  the  classical  sequential  decision  process  with  an  infinite 

planning  horizon  and  discount  factor  0  ,  0  <  0  <  1  .  At  the  beginning  of 

period  n  (n  -  1,2,  ...)  ,  a  player  (decision  maker)  finds  himself  in  one  of 

a  finite  number  of  states  (1,2,  . ..,  N}  ,  say  i  ,  and  is  faced  with 

choosing  one  of  a  finite  number  of  actions  (1,2,  ...,  }  .  As  a 

consequence  of  choosing  action  k  ,  the  player  experiences  an  immediate 

expected  reward,  r  ,  ,  and  a  transition  to  a  new  state  j  ,  the  latter 

k  ?  k 

occurring  with  probability  p^  ,  \  p^  -  1  .  Note  that  both  his  reward 

and  the  probabilities  governing  his  movement  depend  on  the  state  he's  in  (i) 
and  the  action  he  chooses  (k)  . 

A  randomized  stationary  strategy  x^  -  ^xn»x22 . xiK  j  » 

i  -  1,2 . N  ,  is  simply  a  set  of  N  probability  vecotrs  where,  every 

time  the  player  is  in  state  i  ,  x  ^  Is  the  probability  that  he  chooses 
action  k  .  It  follows  that  the  use  of  x^  in  state  i  will  result  in  an 


A  correspondence  $  :  U  -*■  V  is  said  to  have  a  closed  graph  if  for  every 
sequence  u^  -*•  u°  and  v^  -*  v°  with  v^  e  ^(u^)  V  q  ,  we  have  v°  e  $(u°) 


+  +  „n 


6  is  the  present  value  of  a  unit  reward  earned  n  periods  in  the  future. 
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immediate  expected  reward 


(x) 


ka 

J,  ritKik 

k“l 


and  a  transition  to  a  new  state  j  with  probability 


i 

l 

k-1 


P 


k 

ijXik  ‘ 


Hereafter,  the  word  "strategy"  will  mean  "stationary  strategy." 

V(x)  is  defined  to  be  a  column  vector  whose  ith  component,  V^(x)  , 
is  the  expected  total  reward  over  all  future  time,  discounted  to  the  begin¬ 
ning  of  a  period  when  the  player  is  in  state  1  ,  and  strategy  x  is 
employed.  It  is  clear  that  V(x)  satisfies 

(1)  V (x)  -  r(x)  +  6P(x)V(x) 

where  r(x)  is  a  column  vector  whose  ith  component,  r^(x)  ,  is  the 
Immediate  expected  reward  in  state  i  and  P(x)  is  the  Markov  chain, 
whose  ith  row,  P^x)  ,  governs  transitions  from  state  i  ,  when  strategy 
x  is  employed.  From  (1)  we  get 

(2)  V(x)  -  (I  -  6P(x)]_1r(x)  , 

the  inverse  of  (I  -  0P(x)]  guaranteed  since  0  <  6  <  1  . 

* 

The  strategy  x  is  said  to  be  optimal  if  it  maximizes  V(x)  ,  l.e., 

* 

if  for  any  strategy  x  ,  V^x  )  >  V^x)  ,  i  -  1,  ....  N  .  It  is  well 
known  that  in  the  class  of  randomized  strategies,  such  an  optimal  strategy 
exists.  (See  Hadley  [5].) 
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2.4  Fx i stance  of  Equilibrium  Points  In  Discounted  C n h c 

Using  the  method  of  2.2  and  the  structure  of  the  sequential  decision 
process  of  2.3,  we  wish  to  prove  that  on  equilibrium  point  in  stationary 
strategies  exists  for  the  discounted  case  of  a  nonzero-sum  stochustlc  game. 
In  order  to  establish  this  result,  it  will  be  useful  to  show  explicitly  that 
when  player  II  uses  some  fixed  stationary  strategy  y  ,  player  I  is  faced 
with  exactly  the  sequential  decision  process  discussed  in  2.3.  To  see  this, 
suppose  player  II  employs  y  .  Then  if  player  1  chooses  action  k  when  in 
state  i  ,  his  immediate  reward  will  be 


L 

i-l 


aktyii 


and  the  players  will  move  to  state  J  with  probability 


p 


k£- 

ijyU  ’ 


exactly  the  situation  of  a  sequential  decision  process.  Player  I's  total 
discounted  reward  vector  now  depends  upon  the  strategy  of  his  opponent,  y  , 
as  well  as  his  own  and  he  will  seek  to  maximize  V(x,y)  -  [I  -  BP(x,y)  ]_1a(x,y) 
Similar  comments  apply  to  player  II  and  his  attempt  to  maximize  his  total 
value  vector  W(x,y)  -  [I  -  BP(x,y))  *b(x,y)  ,  when  player  I  uses  strategy  x  . 

Let 


be  player  I's  strategy  space, 
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L.  L1 

y  ■  (Vy  2 . V'V'  •  /,  yl« 


-  1  ,  yu  >  0 


be  player  ll’s  strategy  space, 


e^y)  -  |x 


max  V(x,y)  -  V(x,y)i 
xeX  I 


0~(x)  “  ly  I  max  W(x,y)  -  W( 
l  yeY 


x,y)| 


0J  x  02  :  Y  x  X  -*•  2 


Y*X 


Definition: 

The  pair  of  strategies  (x°,y°)  is  said  to  be  an  equilibrium  point  if 
x°  e  01(y°)  and  y°  e  02(x°)  . 

For  any  pair  of  strategies,  (x,y)  ,  both  e^y)  and  ©2(x)  are 
nonempty.  So  following  2.2,  if  it  can  be  shown  that  6^  x  is  convex 
and  has  a  closed  graph,  then  Kakutani's  fixed  point  theorem  can  be  applied 
and  the  existence  of  an  equilibrium  point  established. 


Lemma  1: 

X  Y 

0^  :  Y  -*  2  and  02  :  X  -*•  2  have  closed  graphs. 

Proof : 

A  sufficient  condition  for  0^  to  have  a  closed  graph  is  the  continuity 

of  V(x,y)  .  This  is  assured  since  the  Inverse  of  [I  -  0P(x,y)}  always 

exists  and  its  elements  are  ratios  of  polynomials  involving  the  x  ^  and 

y.,  while  the  elements  of  a(x,y)  are  just  bilinear  terms  in  the  xJt  and 
it  J  ik 

y^L  .  An  identical  argument  on  W(x,y)  and  b(x,y)  yields  the  closed  graph 
2  • 


nature  of  0 
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Lemma  2 : 

0^(y)  can  be  characterized  as  a  closed  convex  polyhedron  whose  extreme 
points  constitute  the  set  of  pure  strategies  that  are  optimal  responses  to 

y  • 

Proof : 

Since  player  I  is  faced  with  a  sequential  decision  process  when  player 
II  fixes  his  strategy  at  y  ,  it  is  necessary  and  sufficient  to  show  that, 
for  a  sequential  decision  process:  (a)  any  convex  combination  of  optimal 
pure  strategies  is  an  optimal  randomized  strategy  and  (b)  if  some  randomized 
strategy  is  optimal,  then  it  is  a  convex  combination  of  optimal  pure 
strategies . 

1  2 

ia)  Suppose  x  and  x  are  pure  strategies  that  are  optimal.  We 

1  2 

want  to  show  that  x^  *  Ax  +  (1  -  A)x  ,  0  <  A  <  1  ,  is  also  optimal.  We 
know: 


(1) 

V(xX)  -  r(x1) 

+  BPCx^VCx1) 

(2) 

V(x2)  -  r(x2) 

+  8P(x2)V(x2) 

(3) 

V'U1)  -  V(x2) 

• 

Now  consider  the  strategy  which  consists  of  using  x^  for  the  first  period 
and  x^  thereafter.  The  total  reward  for  such  a  strategy  is: 

r(x^)  +  BP (x^) V(x1)  =  rlAx1  +  (1  -  A)x2]  +  BP[Ax*  +  (1  -  A)x^]V(x1) 

-  A[r (x1)  +  BP(x1)V(x1)]  +  (1  -  A) [ r (x2)  +  6P(x2)V(x1)] 


AVU1)  +  (1  -  A)V(x2)  -  V(xX) 
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the  last  two  equalities  following  from  (1),  (2)  and  (3).  Hence,  using 
the  first  period  and  x*  thereafter  achieves  the  same  total  value  vector  as 
the  optimal  strategy  x*  .  Hadley  (5]  has  shown  that  this  is  sufficient  to 
imply  that  the  stationary  strategy  x^  Is  also  optimal. 

(b)  First  express  an  optimal  randomized  strategy,  x  ,  as  a  convex 
combination  of  pure  strategies.  This  can  always  be  done  as  follows:  Le’ 
e.  ,  ,  •  /e,  ,e  ,  ....  e  \  be  the  pure  strategy  which  chooses 

12  S*  \  kl  k2  M 

alternative  in  state  i  .  Then 


K1  K2 

l  l  •••  I 


kj-1  k2«l 


Z  x  - ,  x 
k^-1  lkl  2k2 


•  ...  k, 


To  oee  why  this  representation  is  correct,  consider  the  first  coordinates 

of  the  right-hand  side  and  write  as 


ki  r  k2  s, 

i  *lk  i  •••  i 


1  x^k  Z  •••  l  x2k  •••  xnw  'ek 
k1-l  l[k2-l  kN-l  2  T*  K] 


Since 


K2  KN 


y  y  x^i  x...  *  i  i 

L  ,  .  2k,  Nlc, 


k2-i  vi  2 


this  sura  becomes 


1 

/-I  Xlkie>Cl  "  (Xll,Xl2>  “lKjJ  “  X1  ’ 


Similar  statements  can  be  made  about  components  £  K  +  1  through 

j-1  J 
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1+1 


l  K  ,  1  -  1 . N-l  . 

J-l  J 


Having  shown  x  can  be  written  as  a  convex 


combination  of  pure  strategies,  let  us  simplify  the  notation  of  (4)  by 
M  1  1 

writing  x  ■  £  A  eJ  where  {eJ  ,  j  «  1,  M)  arc  those  pure  strategics 

J-l  3 

e.  .  .  with  nonzero  coefficients  in  (4)  and  {A  ,  J  -  1,  M} 

*1*2  * ' *  *N  J 

arc  the  corresponding  coefficients.  We  want  to  dhow  that  all  the  e3  are 
optimal  pure  strategies.  Suppose  that  e3  is  not  optimal.  Then  there 
exists  a  state  i  for  which 


(5)  r1(e1)  +  8P1(e1)V(x)  <  r^x)  +  SP^xjVfx) 


because  if  not,  >  would  hold  in  (5)  for  all  i  which  would  imply,  by  the 
same  argument  used  in  (a),  that  e3  was  also  optimal.  Since  x  is  optimal, 
we  must  also  have 


(6) 


r1(eJ)  +  6P1(ej)V(x)  <  r^x)  +  SP^xjVfx)  J  -  2,  ...,  M  . 


Now  multiply  (5)  by  A^  and  the  equations  of  (6)  by  A 

J 

r^x)  +  BP^x^Vfx)  <  r^(x)  +  BP^xjVCx)  ,  a  contradiction, 
is  optimal. 


and  sum  to  get 
Hence ,  each 


e 


J 


Theorem: 

An  equilibrium  point  in  stationary  strategies  exists  for  the  discounted 
case  of  a  nonzero-sum  stochastic  game. 


Proof : 

Lemmas  1  and  2  imply  that  the  nonempty  0^  *  02  has  a  closed  graph  and 
is  convex.  Kakutani’s  fixed  point  theorem  can  now  be  applied  to  infer  that 
0^  *  ©2  has  a  fixed  point  and  hence  that  the  game  has  an  equilibrium  point 


in  stationary  strategies. 


CHAPTKK  3 


AVKRAGK  RATK  OK  RKTURN  CASK 

3.1  Introduction 

The  development  of  the  average  rate  of  return  case  basically  follows 
that  of  the  discounted  case.  The  existence  of  an  equilibrium  point  in 
stationary  strategies  Is  established  with  the  aid  of  a  linear  programming 
formulation  of  the  problem  and  assumptions  on  the  type  of  Markov  chain  that 
can  underly  the  motion  of  the  players,  the  latter  consideration  leads  to 
the  study  of  three  cases  corresponding  to  the  nature  of  the  underlying 
chains:  irreducible  chains,  chains  with  a  single  ergodic  subchain,  and 

chains  with  multiple  ergodic  subchains. 

Once  again,  the  analysis  depends  crucially  on  the  fact  that  when  one 
player  uses  some  given  strategy,  the  other  is  faced  with  a  sequential 
decison  process.  The  properties  of  the  set  of  strategies  that  optimally 
answer  a  fixed  strategy  of  an  opposing  player  are  used  to  justify  the  use 
of  a  fixed  point  theorem  in  order  to  establish  the  existence  of  an 
equilibrium  point. 

3.2  Multiple  Chain  Case 

The  objective  of  maximizing  average  rate  of  return  in  a  sequential 
decision  process  is  the  player's  desire  to  find  a  strategy  x  so  as  to 
maximize  the  column  vector  G(x)  whose  ith  component,  G^(x)  ,  is  the 

average  reward  per  period  when  the  initial  state  of  the  system  is  i  and 

* 

strategy  x  is  employed  every  period.  If  G^  is  defined  to  be  the  average- 
reward  per  period,  over  all  future  time,  when  the  initial  state  of  the 
system  is  i  and  optimal  decisions  are  made  at  the  beginning  of  every 

A 

period,  then  it  is  true  that  there  exists  a  strategy  that  will  achieve  G^ 
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uniformly  for  all  1  .  This  can  be  accomplished  by  showing  that  for  any 
randomized  strategy,  x  ,  there  Is  a  pure  strategy  that  can  achieve  an 
average  reward  per  period  vector  at  least  as  great  as  G(x)  and  by  using 
the  policy-improvement  routine  of  Howard  [7],  Hence,  the  player's  objective 
is  a  realizable  one. 

Recalling  that  when  player  II  uses  some  fixed  strategy  y  player  I 
Is  faced  with  a  sequential  decision  process,  he  will  wish  to  maximize 
G(x,y)  ,  while  player  II  will  seek  the  maximum  over  y  c  Y  of  H(x,y)  ,  his 
average  rate  of  return  vector  when  player  I  uses  some  fixed  x  .  If 
X  and  Y  are  the  players'  strategy  spaces  and 

My)  "  {  x  I  «ax  G(x,y)  -  G(x,y)l 
l  1  xeX  > 

and 

$2(x)  -  |y  |  max  H(x»y)  ■  H(x,y)| 

we  will  again  have  the 

Definition: 

The  pair  of  strategies  (x°,y°)  is  said  to  be  an  equilibrium  point 
if  x°  e  ♦1(y°)  and  y°  c  $2(x°)  • 

An  example  due  to  Gillette  [4]  demonstrates  that  in  the  average  rate 
of  return  case,  stochastic  games  may  fail  to  have  an  equilibrium  point  in 
stationary  strategies.  Gillette's  three  state  example  is: 
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STATE  BIMATK1X  GAMF.  ANO  TRANSITION  PROBABILITIES 


1,-1 

(1,0,0) 

o 

o 

o 

o 
' — 

1 _ 

0,0 

(1,0,0) 

1,-1 

(0,0,1) 

0,0 

(0,1,0) 


1,-1 

(0,0,1) 


1  V1 

akl,bkl 

(kH  ki  ki  \ 

Pil *Pl2 ,pi3  /  are  tbe  resu^t  players' 

I  and  II  choosing  actions  k  and  £  respectively  and  represent  their 

Immediate  rewards  and  the  probability  vector  governing  their  transition  out 
of  state  1  . 

Notice  that  once  the  players  are  in  state  2  ,  they  remain  there  for 
•11  time,  and  each  receives  an  average  rate  of  return  of  zero.  Similarly, 
once  the  players  are  in  state  3  ,  they  are  sure  to  remain  there  forever 
with  average  rates  of  return  1  and  -1  .  Hence,  the  players  are  only 
concerned  with  their  strategy  choices  in  state  1  ,  to  be  chosen  with  the 
intent  of  maximizing  G^(x,y)  and  H^(x,y)  . 

To  show  that  no  equilibrium  point  exists,  note  that 
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/  {(0,1;);])}  if  y12  >  0 

^(y)  -  j 

(  { ( 1 , 0 ;  1 ;  1 )  }  if  y12  -  0 

(1,0;1;1)  if  xx  -  (0,1) 

(y  |  yl2  >  0)  if  -  (1,0)  . 

Hence,  there  is  no  pair  of  strategies  that  are  mutually  optimal  responses, 
i.e.,  there  is  no  pair  (x°,y°)  for  which  x°  c  ^(y0)  and  y°  e  4» 2 ( x° )  . 

The  crucial  point  to  note  here  is  the  lack  of  continuity  in  the  optimal 
responses  of  player  I  to  a  sequence  of  strategies  of  player  II.  That  is  to 
say,  as  y^  *  0  »  Xj.  m  (0,1)  is  player  I's  optimal  response  as  long  as 
y12  >  0  .  But  for  yJ2  “  0  »  player  I's  optimal  response  is  x^^  «  (1,0)  , 
certainly  not  the  limit  of  a  sequence  of  (0,l)'s  .  This  condition,  which 
arises  because  of  the  multiple  chain  nature  of  the  underlying  Markov  chains, 
is  what  is  preventing  the  existence  of  an  equilibrium  point.  In  the  next 
section,  we'll  see  that  If  the  underlying  chains  have  only  a  single  irreducible 
subclass  of  states,  the  continuity  described  above  will  obtain,  and  an 
equilibrium  point  in  stationary  strategies  will  exist. 

3.3  Irreducible  Chains 

In  light  of  the  multiple  chain  example  of  the  last  section,  we  see  that 
for  a  general  proof  of  the  existur.ee  of  an  equilibrium  point  in  the  average 
rate  of  return  case,  we  must  at  least  restrict  ourselves  to  games  where  no 
matter  what  pair  of  strategies,  (x,y)  ,  is  chosen,  the  Markov  chain  determined, 
P(x,y)  »  has  a  single  irreducible  subclass  of  states.  In  the  current  section, 
an  even  more  restrictive  assumption  on  the  chains  will  be  made,  while  in  the 
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next  section,  we  will  return  to  the  minimal  restriction  mentioned  above. 


Assumption  A: 

All  pairs  of  pure  strategies,  [e.  .  ,  ,e  •  1  *  determine 

\  1  2"'l  1  2*,,lN/ 

an  irreducible  Markov  chain. 


Note  that  this  assumption  is  sufficient  to  guarantee  that  for  any  strategy 
pair,  (x,y)  ,  P(x,y)  is  irreducible.  For  the  rest  of  this  section, 

Assumption  A  is  assumed  to  hold. 

Since  the  chains,  P(x,y)  ,  are  always  irreducible,  the  starting  position 

of  the  players  is  irrelevant  with  respect  to  average  rate  of  return,  and 

Gj(x,y)  ■  Gj(x,y)  and  H^(x,y)  ■  Hj(x,y)  for  all  i  ,  j  will  hold  for  any 

strategy  pair  (x,y)  .  So  letting  the  scalars  g(x,y)  and  h(x,y)  be  the 

players'  average  rates  of  return,  we  may  write  g(x,y)  ■  u(x,y)*a(x,y)  and 

h(x,y)  *  u  (x,y)*b(x,y)  where  w(x,y)  uniquely  satisfies  n(x,y)  -  it (x ,y)P  (x ,  v) , 
N 

l  *,(x,y)  -  1  ,  n.(x,y)  >  0  . 

i-1 

These  inner  products  have  the  interpretation  of  weighted  averages  of 
period  rewards,  i.e.,  in  the  long  run,  when  strategy  pair  (x,y)  is  used,  a 
proportion  fi^(x,y)  of  the  transitions  are  made  through  state  i  and  each 
time  such  a  transition  occurs,  expected  rewards  a^(x.yl  and  bjfx.y)  are 
earned.  We  can  now  simplify  ^  and  as 


^(y)  -  |  x 

|  max 
xcX 

g(x,y) 

4>2M  ■  jy 

|  max 
yeY 

h(x,y) 

g(x,y) 

h(x,y) 


and  attempt  to  show  that  and  ^  have  closed  graphs  and  are  convex. 

Once  again,  since  a  player  opposing  an  opponent's  fixed  strategy  is  facing 
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the  usual  sequential  decision  process,  we  can  concentrate  our  attention  on 
the  set  of  optimal  strategies  for  a  sequential  decision  process,  i.e.,  those 
x  which  solve  the  nonlinear  programming  problem 

Maximize  g(x)  *  n(x)r(x) 

Ki 

Subject  to  l  xlk  -  1  1-1 . N 

k-1 

^  "(x)  -  "(x)P(x) 

N 

l  Vx)  i  o 

l-i 

x  ,  U(x)  >  0 

Manne  (12]  showed  that  (1)  is  equivalent  to  a  linear  program  which  in  turn 
was  showr  by  Wolfe  and  Dantzig  (16]  to  be  equivalent  to  the  generalized 
linear  program 


(2) 


Maximize  rz 


N 

Subject  to  l  Q.  z 
1-1 

z  >  0 

E 


where  0^  is  an  N-vector  of  zeroes  and,  for  1  •  1,  ....  N  ,  is  a 

column  in  the  convex  polyhedron  generated  by  the  extreme  points 


Qik  *  (pil*  Pii  '  1 . PiN’1)  ’  k 


1,  . .  . ,  and  if 


‘  f,  likQlk 

k-1 


wi  th 
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L 

k-1 


A..  >  0 

ik  - 


t 


then 


“i 

, l  Xikrik  * 
k-1 


Problem  (2)  can  be  arrived  at  as  a  direct  consequence  of  seeking 

randomized  optimal  strategies.  The  key  is  to  recognize  that,  when  in  say, 

state  i  ,  choosing  alternative  k  with  probability  •  ••»  • 

is,  in  terms  of  expectations,  the  very  same  thing  as  engaging  in  the  single 

alternative  represented  by  the  appropriate  probability  mixture  of 

"pure"  actions.  This  is  simply  a  restatement  of  the  second  paragraph  of 

k 

Section  2.3.  Hence,  if  is  the  probability  vector  governing  transitions 

out  of  state  i  if  alternative  k  is  chosen  there,  and  r^  ls  the 

associated  immediate  expected  reward,  then  employing  the  randomized  strategy 
x  corresponds  to  choosing  a  single  probability  vector 


P^x) 


Ki  „ 

I  » *ik 

k-1  1 


in  the  convex  polyhedron  P^  generated  by  the  P^'s  *  to  govern  transitions 
out  of  state  i  and  to  earning  the  immediate  expected  reward 


'  J,  rikxik  • 
k-1 


Thus,  suppressing  the  x's  ,  if  Pj  e  P  , i  -  1 ,  . .  .  ,  N  determine  the  Markov 
chain  P  -  (P^,  ...,  P^)  that  governs  state  transitions,  the  associated 
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are  precisely  the  same  weights  on  the  extreme  points  of  P^  needed  to 
determine  the  probability  transition  vector  resulting  from  strategy  x  ,  i.e., 
column  selection  from  is  equivalent  to  specifying  weights  on  the 

extreme  points  of  which  is  the  same  as  specifying  weights  on  the  extreme 

points  of  Pj  ,  an  operation  identical  to  choosing  a  randomized  strategy. 

This  establishes  a  J  -  1  correspondence  between  solutions  to  (?)  nml  our 
original  problem:  If  a  random  I  zed  strategy,  x  ,  results  In  an  average  rate 


of  return  g(x)  ,  thin  the  columns 


nnd  z.  solving 


Vx)  ■  J, 

k«l 

/°N\ 

Q(x)z  "  l  i  result  In  a  value  of  the  objective  rz  *  g(x)  In  (2). 

Similarly,  a  set  of  columns  ,  chosen  as  a  feasible  solution  to  (2), 
can  be  expressed  as  convex  combinations  of  the  extreme  points  of  the  . 
If  we  let  the  weight  on  necessary  to  express  be  the  probability 

with  which  alternative  k  is  chosen  in  state  i  ,  a  strategy  will  result 
with  average  rate  of  return  equal  to  the  value  nf  the  objective  in  (2)  for 
this  corresponding  feasible  solution. 

For  any  strategy  x  ,  the  rank  of  Q(x)  is  less  than  N  +  1  since 
the  first  N  rows  sum  to  zero.  In  fact,  the  rank  of 

is  N  as  can  be  seen  by  considering  the  N  *  N  matrix  Q(x)  obtained  by 
arbitrarily  deleting  one  of  the  first  N  rows  of  Q(x)  .  Assuming 
Q(x)  is  not  of  rank  N  implies  there  exists  p  j*  0  such  that  Qp  =  0 
(suppressing  the  x's).  Because  P  is  irreducible,  there  is  a  unique 
nonzero  bounded  solution  tt  to 

(3)  qc  .  ft)  , 

N 

more  familiarly  written  a  «=  oP  ,  £  a,  ■  1  (Chung  [1]).  A  contradiction 

i*=l 

now  occurs  since  o  *=  n  +  p  will  also  be  a  nonzero  bounded  solution  to  (3) 
since  the  row  deleted  from  Q  can  be  written  as  minus  the  sum  of  the  first 
N  -  1  rows  of  Q  and  the  inner  product  of  each  of  the  first  N  -  1  rows 
of  Q  with  both  t,  and  p  is  zero.  The  last  row  of  (3)  is  satisfied 
by  o  since  the  elements  of  sum  to  one  and  those  of  o  to 
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zero.  This  latter  fact  also  assures  us  that  o  cannot  be  zero. 

We  are  now  prepared  to  prove  the  existence  of  an  equilibrium  point 
for  the  average  rate  of  return  case  with  irreducible  chains.  Following 
Section  2.2  again,  we  will  need  to  show  the  convexity  and  closed  graph 
properties  of  and  $2  • 

Lemma  1 : 

X  Y 

:  Y  -*  2  arid  :  X  -*  2  have  closed  graphs. 


Proof : 

A  sufficient  condition  for  $  to  have  a  closed  graph  is  the  continuity 
of  g(x,y)  «=  rr(x,y)r  (x,y)  where  u(x,y)  is  the  stationary  vector  of  the 
Markov  chain  P(x,y)  ,  and  r(x,y)  is  the  vector  of  immediate  expected 
rewards  when  strategy  pair  (x,y)  is  employed.  Since  the  elements  of 
r(x,y)  are  Just  bilinear  terms  in  the  and  y^  »  we  only  have  to 

demonstrate  the  continuity  of  Ti(x,y)  »  where  n(x,y)  solves 


(x,y)  " 
1  .... 


or  recalling  the  notation  of  Problem  (2),  Q(x,y)rr  “  \  j  /  •  As  a  consequence 
of  the  result  on  the  rank  of  Q(x,y)  ,  we  can  arbitrarily  delete  one  of  the 
first  N  rows  of  Q(x,y)  to  form  Q(x,y)  and  write 


(A)  *  -  Q  1(*,y)(  1N~1)  . 

always  assured  of  the  existence  of  Q  *(x,y)  for  all  (x,y)  .  The  elements 
of  Q  *(x,y)  are  just  ratios  of  polynomials  involving  the  x^  and  y 
so  that  the  (unique)  solution  to  (A)  is,  in  fact,  a  continuous  function  of 
(x,y)  .  An  iden  ical  argument  on  h(x,y)  yields  the  closed  graph  nature  of 
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Lemma  2 : 

$^(y)  can  be  characterized  as  a  closed  convex  polyhedron  whose 
extreme  points  constitute  the  set  of  pure  strategies  that  are  optimal 
responses  to  y  . 

Proof : 

As  we  have  remarked  several  times  in  the  past,  when  player  II  fixes 
his  strategy  at  y  ,  player  I  is  faced  with  a  sequential  decision  process 
that  we  can  put  in  the  form  of  Problem  (2): 

Maximize  a(y)z 

(5)  Subject  to  Q(y)z  - 

z  >  0  , 

m 

Since  we  will  want  to  deal  with  a  linear  program  with  full  rank,  from  now 
on,  it  will  be  assumed  that  the  Nth  row  in  (5)  is  deleted,  so  that 
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assumption  imply  a  ]  -  ]  correspondence  between  strategies  x  and  feasible 
nondegenerate  bases  Q(x,y)  . 

As  in  l.emtna  2  of  Section  2.4,  the  proof  will  be  accomplished  if  it  is 

shown  that:  (a)  any  convex  combination  of  optimal  pure  strategies  is 

an  optimal  randomized  strategy  and  (b)  if  some  randomized  strategy  is  optimal, 

then  it  is  a  convex  combination  of  optimal  pure  strategies, 

1  2 

(a)  Suppose  x  and  x  are  pure  strategies  that  are  optimal  responses 

12 

to  y  .  We  want  to  show  that  x^  *  >.x  +  (1  -  A)x  ,  0  <  X  <  1  ,  is  also 

an  optimal  response  to  y  .  The  analysis  used  for  the  development  of  Problem 

1  2 

(2),  and  consequently  Problem  (5),  enables  us  to  say  that  x  and  x 

1  -  2  - 

correspond  to  the  optimal  bases  Q(x  ,y)  and  Q(x  ,y)  and  to  the 

-  1  -  2  - 

basis  Q(x^,y)  =  XQ(x  ,y)  +  (1  -  X)Q(x  ,y)  .  If  p^  is  the  N-vector  of 

(optimal)  simplex  multipliers  associated  with  Q(x*,y)  ,  i  «=  1  ,  2  ,  we 

must  have  =  u2  '  This  is  so  because  under  nondegeneracy  (which  is 

implied  here  by  irreducibi 1 i ty) ,  the  optimal  multipliers  pj  must  price 

out  the  columns  in  Q(x‘",y)  zero.  (This  need  not  be  true  if  the  optimal 

strategy  x^  results  in  some  transient  states,  for  this  results  in  degeneracy. 

2  -  2  - 

See  Section  3.4.)  Hence,  a(x  ,y)  -  p^Q(x  ,y)  =  0  which  implies 
2  -  -1  2  - 

p^  =  a (x  ,y)Q  (x  ,y)  =  p^  •  It  now  follows  that  p^  ,  the  simplex 
multipliers  associated  with  Q(x^,y)  must  also  equal  p^  sirje 
PjQ(xx,y)  =  Pj^XQfx^y)  +  (1  -  X)Q(x2,y)]  ■=  Xa(x1,y)  +  (1  -  A)a(x*,y)  -  a(xx,y)  , 
so  that  Q(xx,y)  must  also  be  an  optimal  basis  and,  therefore,  x^  ai: 
optimal  strategy. 

(b)  Suppose  x  is  an  optimal  randomized  strategy  with  associated 
basis  Q(x,y)  and  optimal  simplex  multipliers  p  .  Recall  that  x  can 
be  written  as  a  convex  combination  of  pure  strategies: 

s 

I 

V1  V1 
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Tneref  ore ,  Q(x,y)  ■>  [  ...  I 


it  m  I  k  **  1  1 

I  N 


"V(vv) 


a(x,y)  -  l 


V1  V1  1 


L  x  -  *v(v  v) 


By  definition  p  satisfies  a(x,y)  -  pQ(x,y)  *=  0^  ,  i.e  , 


Ki  h 


Since  we  want  to  show  that  the  coefficients  x,,  ...  x.„  in  this  sum  are 

lki  % 

positive  only  if  e,  ,  is  optimal,  we  break  the  sum  into  two  parts, 

*1  *  *  •  ^ 

one  corref pondi ng  to  the  set,  0  ,  of  actions  kj . ^  that  are  optimal, 

and  another  corresponding  to  the  set  D  of  actions  that  result  in  nonoptiinal 
bases.  Now  we  have 


h..l,vo  x-H[a(vv)’u9K-v)]  + 

(6) 

lkl  NkN  [  (  V1”\,V)  UQ(ekr- ,kN,y)]  °N 

where,  by  the  optimality  of  p  ,  ale  ,y  |  -  pQ/e,  ,  ,y  ]  <  0  for 

\  kl"'kN  /  \  kl - N  )  m  N 

all  ,  • • > )  k^  • 

But  under  nondegeneracy ,  a  vector  of  optimal  simplex  multipliers  prices 
out  all  the  columns  of  another  basis  zero  if  and  only  if  the  other  basis 
is  optimal.  Hence,  the  first  sum  in  (6)  vanishes  (by  the  "if")  and  at  least 
one  of  the  elements  is  negative  in  every  vector  a/e  ,  ,y  )  -  pO[e  ,  ,y 

\  V  *  ,kN  /  \  kl*  *  *  N 

in  the  second  sum  of  (6)  (by  the  "only  if")  .  Therefore,  we  must  have 
Xlk  ®  f°r  kl . I^n  e  0  in  order  to  maintain  the  equality 
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in  (6),  and  the  lemma  is  proven. 


Theorem : 

Under  Assumption  A,  an  equilibrium  point  in  stationary  strategics  exists 
for  the  average  rate  of  return  case  of  a  nonzero  sum  stochastic  game. 


Proof : 


Identical  to  discounted  case  with 


replacing 


i  -  1  ,  2  . 


3.4  Chains  with  a  Single  F.rgodlc  Suhchain 


Having  dealt  with  the  two  extreme  cases  of  finite  Markov  chains  in  the 
last  two  sections,  we  are  now  left  with  the  "in-between"  case  of  a  Markov 
chain  that  allows  for  some  transient  states,  but  only  one  irreducible  subset 


of  states. 


Such  a  chain  may  be  taken  to  look  like 


where  every  row 


of  Aj  has  at  least  one  positive  element  and  the  subchain  P  is  irreducible. 
(The  previous  section  assumed  A^  vacuous.)  We  will  make  an  assumption 
analogous  to  that  of  the  last  section  and  then  show  that  equilibrium  points 
still  exist  on  this  middle  ground,  although  the  proofs  of  Section  3.3  must 
be  modified. 


Assumption  B: 

All  pairs  of  pure  strategies,  /e  ,  ,e  \  ,  determine  a  Markov 

\  kl”'*N  i"  *  *  N/ 

chain  with  a  single  ergodic  subchain. 

This  assumption  is  sufficient  to  guarantee  that  for  any  strategy  pair 
(x,y)  •  P(x,y)  has  a  single  ergodic  subchain.  For  the  rest  of  this  section, 
Assumption  B  is  assumed  to  hold. 

Once  again,  the  initial  state  of  the  players  will  have  no  bearing  on 
their  average  rates  of  return  which  can  still  be  expressed  as 
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g(x,y)  K  11  (x  »y)  * a  (x  >  y )  anc^  h(x,y)  D  it  (x ,  y  )b  (x  ,y )  since  the  solution  to 

N 

n(x,y)  «=  T(x,y)I*(x,y)  ,  £  ".(x,y)  *»  1  ,  it  (x,y)  >  0  remains  unique.  The 

1-1 

generalized  linear  programming  approach  (Problem  (5)  of  Section  3.3)  can 
also  be  used  again.  The  proof  that  the  rank  of  Q(x,y)  is  N  given  in 
the  last  section  also  holds  under  Assumption  B  and  sustains  the  validity  of 
Lemma  1  and  the  1-1  correspondence  between  strategies  and  feasible 
bases.  However,  the  fact  that  a  basis  Q(x,y)  may  now  be  degenerate 
(corresponding  to  transiency  in  the  chain  P(x,y))  leads  to  the  breakdown 
of  the  convexity  of  $j(y)  proved  in  Lemma  2.  Specifically,  p  may  be  a 
vector  of  simplex  multipliers  associated  with  an  optimal  basis,  but  may  fail 
to  price  out  all  the  columns  of  another  optimal  basis  zero,  since  under 
degeneracy,  dual  feasibility  of  p  is  sufficient  but  not  necessary  for 
optimality.  Examples  of  the  nonconvexity  of  4>j(y)  in  the  presence  of 
transient  states  are  easily  constructed. 

At  this  point,  the  natural  thing  to  do  is  to  turn  to  a  generalization 
of  Kakutani's  fixed  point  theorem  that  would  weaken  the  convexity  requirement 
on  4> ^ (y )  ,  since,  as  remarked  above,  the  closed  graph  property  of  <fc^(y) 
still  obtains  under  Assumption  B.  Such  a  generalization  has  been  given  by 
Debreu  [2]  in  an  adaptation  of  a  fixed  point  theorem  of  Eilenberg  and 
Montgomery  (3].  Here,  the  convexity  requirement  is  replaced  by  the  require¬ 
ment  that  if>j(y)  be  contractible,  the  topological  equivalent  of  convexity. 
Schweitzer  [14]  has  shown  that  <f^(y)  may  be  represented  as  a  convex  set 
with  convex  protuberances,  but  such  a  set  may  in  general  fail  to  be 
contractible,  and  it  is  not  clear  how  to  use  the  properties  of  the  special 
structure  at  hand  to  show  that  $^(y)  is,  in  fact,  contractible.  However, 
Schweitzer's  decomposition  of  $^(y)  leads  to  another  consideration  which 
results  in  a  way  that  circumvents  the  current  difficulty. 
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Suppose  we  denote  by  i^(y)  that  subset  of  4>j  (y)  that  is  the  convex 
"core"  of  $  (y)  ,  i.e.,  $^(y)  -  ij^(y)  composes  the  protuberances  of 

♦1<y)  •  Convexity  will  no  longer  be  a  problem  if  we  can  show  that  i{>j(y) 
has  a  closed  graph,  for  then  wo  can  still  use  Kakutani's  theorem  to  prove 
the  existence  of  a  fixed  point  for  ^(y)  .  This  will  be  good  enough  to 
insure  the  existence  of  an  equilibrium  point  because  elements  of  ^(y) 

(and  ^(x))  aro  still  optimal.  Interestingly  enough,  in  Schweitzer's 
decomposition  of  4>j  (y)  ,  the  convex  set  from  which  convex  protuberances 
emenatc  is  the  set  of  all  optimal  strategies  x  with  associated  basis 
Q(x,y)  that  determine  simplex  multipliers  u(x,y)  -  a(x,y)Q  *(x,y)  which 
arj  dual  feasible,  i.e.,  price  out  all  the  extreme  points  (and  hence,  all 
columns)  of  C^( y)  nonposit ively  for  all  i  . 

More  formally,  define: 

Cy )  *=|x  |  max  g(x,y)  =  g(x,y)  ,  aik(y)  -  p(x,y)Qlk(y)  <  0  V  i  ,  k 
'  xi  X 

where  y(x,y)  *=  a(x,y)Q  ^x.y)^  . 

Cx)  is  analogously  defined  for  the  generalized  linear  program  that  arises 
when  Player  II  has  to  find  an  optimal  response  to  Player  I's  use  of  x  . 


Lemma  1: 


X  Y 

:  Y  •*  2  and  ^  :  X  -*■  2  have  closed  graphs, 


Proof : 

Let  {y11}  be  a  sequence  of  Player  Il's  strategies  converging  to  y° 
and  {xn}  the  sequence  of  corresponding  optimal  responses  of  player  I,  i.e., 
xn  c  i|)^(yn)Vn  ,  with  xn  x°  .  We  have  to  show  that  x°  is  an  optimal 
response  to  y°  .  Since  xn  c  ^(y11)  V  n  ,  aik(yn)  “  k (xn ,yn)Qik(yn)  <  0  V 
i  ,  k  ,  n  .  The  continuity  of  a^k(y)  *  k(x*y)  and  Q^k(y)  *s  assure<*  as 


I 
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In  Lemma  1  of  the  last  section  with  Assumption  B  crucial  here 
the  existence  of  p(xn,yn)  -  a(xn,yn)Q  *(xn,yn)  for  all  n  . 
aik(y°)  -  u(xO,y°)0lk(y°)  <  0  giving  x°  c  if>j(y°)  . 
graph  by  the  same  argument. 


in  guaranteeing 
Hence , 

has  a  closed 


Lemma  2 : 

(y)  and  ^(x)  ate  convex. 

Proof : 

Theorem  10,  Schweitzer  [14]. 


Theorem: 

Under  Assumption  B,  an  equlibrium  point  in  static  strategics 
exists  for  the  average  rate  of  return  case  of  a  nonzero  sum  sotchastic 
game. 


Proof : 


Identical  to  discounted  case  with  replacing  0^  ,  i  -  1  ,  2  . 


3.5  Extensions 


The  above  theorem  is  easily  generalized  in  two  directions.  The  same 

development  follows  if  we  have  an  n-person  stochastic  game  where  an  n-tuple 

of  strategies,  s°,  ...,  s°  ,  one  for  each  player,  is  nn  equilibrium  point 

if  for  any  player,  say  the  ith,  s°  maximizes  player  i's  average  rate 

of  return  when  opposing  the  fixed  strategies  s°  ,  i  j*  j  ,  of  the  other 

n  -  1  players.  We  can  appropriately  define  the  correspondences  ..., 

each  being  a  subset  of  the  optimal  solutions  to  a  sequential  decision  process. 

Consequently,  Lemmas  1  and  2  ho'.d  for  all  i|/  ,  so  that  -  ip,  *  .  .  .  *  ^ 

i  In 


has  a  fixed  point  and  an  equilibrium  point  exists. 
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Another  >’. « •  1 1 1  r.il  i  /at  I  on  cone,  ms  the  unde. lying  law  of  motion  governing 
t lir*  players'  state  transitions.  If  we  assume  that  the  players'  joint  choice 
of  actions  in  a  state  not  only  determines  their  immediate  expected  rewards 
and  transition  probabilities,  but  also  specifies  a  probability  distribution 
of  the  time  to  the  next  transition  (that  may  depend  on  the  9tate  to  which 
a  transition  is  made),  then  a  semi-Markov  process  underlies  the  motion  of 
the  players  (rather  than  a  Markov  chain  when  the  above  mentioned  probability 
distributions  are  degenerate  at  a  unit  time)  and  their  objectives  become 
maximization  of  long  run  average  rate  of  return  per  unit  time.  Howard  [8] 
showed  that  for  the  one  player  case  of  this  set-up,  an  optimal  policy  for 
the  sequential  decision  process  only  depends  on  the  probability  distribution 
of  transition  times  through  their  first  moments. 

In  addition,  the  problem  can  be  formulated  as  a  generalized  linear 
program  just  as  in  the  Markov  chain  case.  All  chat  is  needed  is  to  modify 
(5)  in  Section  3.3  by  changing  the  extreme  points  of  C^(y)  to 


q;„G)  -  Qik<y) 


'ik<y) 


where 


t 


T 


i  - 1 
k£yi 


and  t  is  the  mean  time  spent  in  state  i  if  the  players  use  pure 

K  v 

strategy  pair  (e^,e^)  .  Lemmas  1  and  2  remain  unchanged  for  the  appropriately 
modified  and  so  that  the  existence  theorem  also  holds  for  this 


more  general  case. 


V 


3.6  An  Equivalence  Theorem 


Just  as  it  is  possible  to  pose  a  sequential  decision  process  as  a 
generalized  linear  programming  problem,  it  is  possible  to  cast  a  two-person 
nonzero-sum  stochastic  game  under  Assumption  A  as  a  programming  problem, 
although  a  nonlinear  one  in  this  case.  Let 


E(y)  -  (Qn(y>, 


QlKj (y) '  ••••  QNl(y)’ 


be  the  matrix  of 


extreme  points  of  the  convex  polyhedra  C ^ ( y )  ,1*1,  ...,  N  ,  determined  by 
player  II's  strategy  y  and  from  which  player  I  is  to  choose  columns  in 
problem  (5)  of  Section  3.3.  F(x)  is  analogously  defined  to  be  the  matrix  of 
extreme  columns  of  the  generalized  linear  program  faced  by  player  II  when 
player  I  uses  strategy  x  .  Let 


e(y)  *= 

(au<y). 

....  a1K^(y), 

•  •  • »  (y ) * 

••••  aNKN(y))  and 

f(x)  * 

^bn(x)  , 

....  (x) , 

...»  (x ) , 

...,  b  (x)\  be  the  vectors  of 

N  / 

associated  rewards. 


It  is  easily  seen  that  the  linear  program 


Maximize  e(y)w 

(1)  subject  to  E(y)w  * 

w  >  0 

is  equivalent  to  Section  3.3's  problem  (5).  Given  a  solution  w  to  (1) 
above,  we  can  get  a  solution  to  (5)  by  letting  the  weight  on  column  Q^(y) 
needed  to  express  Q^(y)  be 


x 


ik 


and  letting 
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Lemma : 

Under  Assumption  A,  a  necessary  and  sufficient  condition  for  the 
strategy  pair  (x°,y°)  to  be  an  equilibrium  point  is  that  there  exist  p° 
and  v°  such  that 

(i)  (e(y°)  -  u°F. (y°)  ]x°  -  0 

(ii)  (f(x°)  -  v°F (x°) ] y°  =  0 

(iii)  e(y°)  -  u°E(y°)  <  0 

(iv)  f(x°)  -  v°K(.x°)  <  0 

(v)  x  r  X 


(vi)  y  e  Y 


3  4 


Nc He: 

g°  and  v°  must  be  tin*  vectors  of  simplex  multipliers  associated  with 
x°  and  y°  and  problems  (1)  and  (3)  above. 

Proof : 

The  existence  of  a  g°  satisfying  (i)  and  (iii)  is  guaranteed  by  (2) 
and  the  necessary  and  sufficient  conditions  for  optimality  of  the  non¬ 
degenerate  (Assumption  A)  linear  programming  problem  (1).  Symmetrically,  the 
existence  of  a  v°  satisfying  (ii)  and  (iv)  is  guaranteed  by  (4)  and  the 
optimality  conditions  of  the  linear  programming  problem  (3). 

Theprem: 

Under  Assumption  A,  a  necessary  and  sufficient  condition  for  the 
strategy  pair  (x°,y°)  to  be  an  equilibrium  point  is  that  x°  ,  y°  ,  and 
some  u°  .  v°  solve  the  nonlinear  programming  problem 

Maximize  {(e(y)  -  pE(y)]x  +  [f(x)  -  vF(x)]y}+ 
subject  to  e(y)  -  gE(y)  <  0 
(*)  f(x)  -  vF(x)  <  0 

x  e  X 
y  t  Y  . 

♦ 


Vetting  6^  -  0  for 

be  written 

N  Ki  N-l 

l  l  l 

i-1  k-1  j=l 


i  i  J 


and 


i 

k£ 


■  1  for  i  -  j  ,  the  objective  can 


N  Ki  N-l  Li 


l  l  l  l  ( 

i-1  k-1  j-1  £-1  v 


v  x  y 
j  ik  i£ 


+  V  • 
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Proof: 

Sufficiency:  Let  x°  ,  y°  ,  y°  ,  v°  solve  (*)  .  We  will  show  that 

(1)  through  (vi)  in  the  lemma  hold.  Feasibility  guarantees  (iii)  through 
(vi)  and 

(5)  (e(y°)  -  W°K(y°) ]x°  +  [f  (x°)  -  v°F (x°)  ]y°  <  0  . 

But  by  the  existence  theorem  of  Section  3.3,  there  exist  x  ,  y  ,  vi  ,  v 

satisfying  (i)  through  (vi).  Hence,  there  exis;s  a  feasible  solution  to 

(*)  with  t lie  value  of  the  objective  equal  to  zero  so  that  equality  must 

hold  in  (5).  Finally,  equality  in  (5)  and  (iii)  through  (vi)  imply  (i) 

and  (ii)  are  satisfied.  Now  we  can  apply  the  sufficiency  part  of  the  lemma 

to  infer  that  (x°,y°)  is  an  equilibrium  point. 

Necessity:  let  (x°,y°)  be  an  equilibrium  point  with  associated 

0  o 

simplex  multipliers  p  and  v  .  Then  (i)  through  (vi)  hold  implying 
x°  ,  y°  ,  p°  ,  v°  are  feasible  for  (*)  and,  in  fact,  solve  (*)  since 
zero  is  achieved  for  an  objective  that  is  nonpositive  for  all  feasible 
solutions  to  (*)  . 

Several  comments  can  be  made  about  the  equivalence  theorem.  For  a  one 

state  problem  (N  =  1)  ,  the  theorem  reduces  to  a  theorem  given  by 

Mangasarian  [11]  for  bimairix  games.  Another  note  of  interest  is  the  fact 

that  the  average  rate  of  return  for  players  I  and  II  associated  with 

equilibrium  point  (x°,y°)  is  u°  and  v°  respectively,  a  consequence  of 

N  N  o/°N  l\  o 

the  duality  theorem  of  linear  programming  since  p  ^  an<^ 

V°(  1  =  VN  ’(  1  being  the  right-hand  side  of  both  players'  linear 

programs.  Finally,  only  the  sufficiency  part  of  the  theorem  holds  under 
Assumption  B  since, (  under  degeneracy,  some  equilibrium  points  (x,y)  may 
have  associated  simplex  multipliers  that  are  not  dual  feasible,  i.e.,  fail  to 
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satisfy  (ill)  and/or  (iv). 

The-  complexity  of  the  equivalent  nonlinear  program  indicates  that  it  may 
be  most  difficult  to  find  its  solutions.  An  intuitively  appealing  approach  is 
the  following  Iterative  scheme:  at  the  nth  iteration,  x  and  y  are  fixed 
at  some  x(n)  ,  y(n)  .  The  associated  optimal  simplex  multipliers,  w(n) 
and  v(n)  ,  to  the  linear  programs  determined  by  x(n)  and  y(n)  are  then 
found.  Then  (*)  is  solved  with  u  and  v  fixed  at  p  (n)  and  v(n) 
respectively.  This  determines  x(n  +  1)  and  y(n  +  1)  to  be  used  for  the 
(n  +  l)st  iteration.  Hopefully,  x('-'  and  y(n)  converge  to  an  equilibrium 
pair.  But  there  is  no  guarantee  that  this  process  will  converge  to  a  solution 
to  (*)  (and  hence  an  equilibrium  point)  since  the  possibility  exists  that  the 
scheme  will  get  hung-up  around  a  set  of  variables  x  ,  y  ,  p  ,  v  where  p 
and  v  are  simplex  multipliers  determined  by  x  and  y  and  x  and  y 
solve  (*)  for  the  fixed  p  and  v  .  It  is  interesting  to  note,  however, 
that  to  find  the  (unique)  equilibrium  point  of  a  zero-sura  stochastic  game, 
this  procedure  works  and  is  precisely  the  same  algorithm  as  the  one  given  by 
Hoffman  and  Karp  [  6  ]  for  determining  the  value  and  optimal  strategies  for 
zero-sum  games. 

3.7  Possibilities  for  Further  Research 

Both  applied  and  theoretical  problems  related  to  the  results  given  here 
present  possibilities  for  further  research.  Many  authors  have  discussed  the 
formulation  of  an  infinite  horizon  periodic  review  inventory  model  as  a 
sequential  decision  process.  The  state  at  the  beginning  of  a  period  is  the 
inventory  on  hand  and  the  actions  available  to  the  system  operator  correspond 
to  the  level  up  to  which  he  orders  while  the  expected  immediate  rewards 
correspond  to  the  expected  net  revenue  for  the  period:  expected  sales  minus 
expected  ordering,  holding,  and  shortage  costs.  The  probability  distribution 
of  demand  and  a  particular  choice  of  order  levels  determine  tfce  transition 
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probabilities  that  govern  state  transitions. 

Now  consider  two  operators  of  inventory  systems  who  stock  the  same  item. 
If  a  demand  is  unsatisfied  by  the  first  operator,  it  is  reasonable  to  assume 
that  this  demand  may  revert  to  the  second  operator  rather  than  be  backordered 
with  the  first,  and  thus  affecting  the  demand  pattern  and  reward  structure  of 
the  second.  A  similar  statement  is  true  about  a  demand  unsatisfied  by  the 
second  operator.  Hence,  the  policies  of  the  two  operators  may  be  considered 
as  a  nonzero-sum  stochastic  game  since  the  reward  structure  and  transition 

I 

probabilities  clearly  depend  on  the  operators'  Joint  actions.  Rational 
operators  (in  the  game  theoretic  sense)  of  such  inventory  eystems  will  tend 
to  seek  equilibrium  operating  strategies. 

Consideration  of  such  a  problem  leads  directly  to  two  possible  extensions 
of  a  theoretical  nature.  A  characterization  of  the  set  of  all  equilibrium 
points  (perhaps  making  use  of  the  equivalence  theorem)  of  a  nonzero-sum 
stochastic  game  would  be  helpful  in  resolving  situations  where  one 
equilibrium  point  is  preferred  by  one  operator  and  a  second  equilibrium  point 
by  the  other,  or  a  situation  where  one  equilibrium  point  is  better  (for  both 
players)  than  all  others.  A  second  extension  would  deal  with  the  problem  of 
partial  state  information,  i.e.,  a  player  has  some  idea  about  the  state 
he's  in  (fur  example,  his  inventory  level)  but  lacks  total  state  Information 
(for  example,  his  opponent's  inventory  level). 

Other  areas  for  further  work  readily  follow  from  the  consideration  of 
extensions  to  basic  game  theory  and  sequential  decision  problems,  e.g., 
co-operative  games,  various  solution  concepts,  and  allowing  for  a  countable 
□umber  of  states  and  actions. 
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